A deep learning-based framework for retinal fundus image enhancement

Problem Low-quality fundus images with complex degredation can cause costly re-examinations of patients or inaccurate clinical diagnosis. Aim This study aims to create an automatic fundus macular image enhancement framework to improve low-quality fundus images and remove complex image degradation. Method We propose a new deep learning-based model that automatically enhances low-quality retinal fundus images that suffer from complex degradation. We collected a dataset, comprising 1068 pairs of high-quality (HQ) and low-quality (LQ) fundus images from the Kangbuk Samsung Hospital’s health screening program and ophthalmology department from 2017 to 2019. Then, we used these dataset to develop data augmentation methods to simulate major aspects of retinal image degradation and to propose a customized convolutional neural network (CNN) architecture to enhance LQ images, depending on the nature of the degradation. Peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), r-value (linear index of fuzziness), and proportion of ungradable fundus photographs before and after the enhancement process are calculated to assess the performance of proposed model. A comparative evaluation is conducted on an external database and four different open-source databases. Results The results of the evaluation on the external test dataset showed an significant increase in PSNR and SSIM compared with the original LQ images. Moreover, PSNR and SSIM increased by over 4 dB and 0.04, respectively compared with the previous state-of-the-art methods (P < 0.05). The proportion of ungradable fundus photographs decreased from 42.6% to 26.4% (P = 0.012). Conclusion Our enhancement process improves LQ fundus images that suffer from complex degradation significantly. Moreover our customized CNN achieved improved performance over the existing state-of-the-art methods. Overall, our framework can have a clinical impact on reducing re-examinations and improving the accuracy of diagnosis.

Introduction framework, comprising new processes for dataset construction, data augmentation, and a new model for supervised learning. To this end, we established a process to construct a dataset of LQ and HQ image pairs. LQ images contain various degradation, such as blur, haze, low illumination, and artifacts such as eyelashes or tears. Moreover, we include various abnormal images with diseases and normal images without disease within the LQ and HQ image pairs so that the framework is unbiased toward normal images. Based on this dataset, we propose a framework for data augmentation and a novel CNN structure that can enhance images depending on the degradation. We conducted comparative quantitative and qualitative evaluations using private and public datasets to demonstrate the effectiveness of the proposed method.
Overall, our main contributions are as follows: • We establish a unique training dataset that includes LQ and HQ image pairs, consisting of various abnormal features for major eye diseases, which differs from that of other studies that apply a single diagnosis (for example, diabetic retinopathy). We trained the framework to preserve all the clinically important features during the enhancement process because approximately 50% of our dataset has at least two or more diagnoses of diseases such as agerelated macular degeneration, diabetic retinopathy, and epiretinal membrane.
• We propose data augmentation methods to simulate major aspects of retinal image degradation, including blur, haze, and low illumination to reduce the limitations in the dataset collection.
• We present a customized CNN architecture that incorporate attention layers into the U-net structure, resulting in improved performance in quantitative and qualitative evaluations.

Deep learning-based methods for retinal fundus images
Recently, advanced deep learning-based systems have achieved significant performance in the grading and classification of retinal fundus images and in detecting specific landmarks (mainly vessels) or diseases, such as diabetic retinopathy. Several works [22][23][24][25] have proposed automatic retinal fundus image grading systems using a CNN as the backbone to generate feature vectors that are given as the input of a classifier. These methods may be the basis of more automated clinical procedures compared to existing traditional procedures for retinal diseases where doctors performed the jobs manually. A study [26] has shown that the extracted retinal image feature can be used as an input for recurrent neural networks to generate a detailed clinical description.
Many recent studies have stressed that using simple CNN architecture to extract features from retinal fundus images can effectively improve the performance of the vessel segmentation task [27][28][29][30][31]. Other studies [32,33] proposed to apply dilated convolution to overcome the limited information with a fixed-sized receptive field of conventional CNN architectures to better estimate the vessels in the retinal fundus image. In the work of Jiang et al. [34], a multiscale information fusion module is added to the dilated CNN architecture to enlarge the receptive field of the CNN.
Some studies have shown the effectiveness of using attention mechanisms with multiscale operations or enlarged receptive fields. Zhang et al. [35] proposed an attention-guided filter to recover spatial information and merge structural information from the various resolution levels by filtering the low-resolution feature maps with high-resolution feature maps. Jiang et al. [36] also proposed a residual attention module to highlight important areas in fundus images, filter noise from the background, and solve the problem of information loss caused by downsampling. In Mou et al. [37], both the 2-dimensional spatial attention and channel attention modules were used to enrich contextual dependencies over local feature representations, and exploit the interdependencies of channel maps, resulting in improved vessel segmentations.
Many other studies have particularly based on the U-Net [38] architecture. Gao et al. [39] formulated the vessel segmentation task as a multi-label problem and combined the Gaussian matched filter with U-Net to generate a blood vessel segmentation framework. Alom et al. [40] proposed the Recurrent CNN (RUNet) and Recurrent Residual CNN model (R2U-Net) architectures for segmentation tasks. Kamran et al. [41] proposed a multiscale generative architecture for accurate retinal vessel segmentation and to alleviate the inability of the decoder to recover lost information from the encoder of the U-Net.

Enhancement of retinal fundus images
Several methods have been proposed that recover details of the vessels or the macula from degraded LQ images by enhancing the brightness, contrast, or luminance of images. Zhou et al. [42] and Palanisamy et al. [43] revealed that luminance and contrast were improved with γ-correction and contrast-limited adaptive histogram equalization. Reddy et al. [44] used texture histogram equalization. Foracchia et al. [45] and Leahy et al. [46] proposed methods based on the estimation of degradation features, such as luminance, contrast, or illumination to achieve enhancement. Kubecka et al. [47] proposed the optimization of parameters of the Bspline shading model using Shannon's entropy. Mustafa et al. [48] proposed a normalization of the background image using a low pass filter and a gaussian filter. These methods are based on local pixel statistics, and applicable without prior learning from ground truth (GT) images. However, this also leads to limited adaptability or generalizability, depending on the complex degradation factors in the fundus image.
Many studies have also been proposed on fundus image enhancement using deep learning. Savelli et al. [49] devised a structurally serialized CNN for correcting illumination. Even with a simple CNN structure, information on degradation characteristics on the fundus image is adeptly inferred by understanding the relative context of the patch. Zhao et al. [50] proposed a GAN-based framework to enhance blurry fundus images. This GAN architecture does not require actual low-high-quality training image pairs, and is suitable when data is limited. However, the number of degradations that can be improved at one time is limited because the latent space in GAN is uninterpretable and unmanipulable.
Since deep learning-based methods require substantial training data, synthesized images can effectively supplement insufficient real training images [51]. Methods that model the degradation factors are thus relevant in this context. Hide [52] introduced an atmospheric scattering model to explain the formation of haze, and this was further developed by other studies [53,54]. Xiong et al. [55] modeled a blurry fundus image, using the atmospheric scattering model, suggesting a method for estimating the transmission map and background illuminance. Shi et al. [56] applied γ-correction to the model and improved image illumination.

CNN architectures with attention
Here, we review relevant CNN architectures to our customized attention-based CNN network. Attention within a CNN is an operation where the network learns to attend to particular feature values through adaptive scaling. Many attempts have been made to incorporate various attention mechanisms into networks [57,58]. The network learns to scale local features through spatial attention. The network also learns to scale particular feature channels, corresponding to important image characteristics, through channel attention.
Oktay et al. [59] and Li et al. [60] proposed network structures that combined a spatialattention module with U-Net [38]. These studies learned the relative importance of spatial between pixels of a feature map for performing segmentation of a target object in a medical image. Rundo et al. [61] confirmed the importance of channel-wise recalibration of the feature map in the segmentation task of MRI image, by inserting a Squeeze-and-Excitation (SE) module [62], which was a channel-attention within a U-Net. Studies also combined the spatialattention and channel attention parallelly [63,64] or sequentially [65]. Sun et al. [66] included a parallel spatial and channel attention structure in the skip connections between the encoder and decoder blocks in the U-Net. Zhao et al. [67] and Gu et al. [68] used a sequential spatial and channel attention structure. Zhao et al. [67] noted that a spatial-attention module was used at the network interface; whereas a channel-attention module was used to generate latent representations and reduce computational complexity. Gu et al. [68] placed channel-attention modules at every decoder block to learn to generate segmentation maps from the encoded latent representation.

Data preprocessing
Registration. Given that fundus image pairs for the same patient at different times are nonidentical due to the differences in camera viewpoint or patient pose, image registration is required to ensure the local correspondence of LQ and HQ images during network training. We used the SURF-PIIFD-RPM method, proposed by Wang et al. [69], using affine transformation or second-order polynomial transformation depending on the image, to perform robust alignment for the image with rotation and scale-invariant SURF feature points [70]. We manually annotated the corresponding points to guide the registration in the rare cases, where SURF key point matches were obtained incorrectly. Fig 1 shows the registration results of a sample image pair from the training dataset.
Patch generation. To adhere to the constraints in GPU memory, we used smaller patches of size 320 × 320 × 3, cropped from the original images. For training, we chose 5 patches around the macular, 10 patches around the crossing point of the vessels, and 5 patches randomly across the entire fundus image. We tested our network on non-overlapping tiled patches of the whole image.

PLOS ONE
Augmentation. We supplemented the limited number of images in our dataset using data augmentation. We considered five different augmentation factors: i) rotation, ii) linear interpolation, iii) blur, iv) haze, and v) illumination.
For rotation, we added three rotated versions of images with angles of 90˚, 180˚, and 270˚. With the additional rotations, the network can learn rotation-agnostic features, such as vessels or macular patterns, which must be consistently enhanced, invariant to image orientation.
For linear interpolation, we generated new LQ images, L I,new using linear interpolation between the LQ, L I and HQ, H I images as follows: where we assigned four different values for the scalar variable λ = (0.2, 0.4, 0.6, 0.8), which controls the degree of interpolation. This augmentation enables the network to consistently enhance images with intermediate qualities between the LQ and HQ images [71].
For the blur, we generated new LQ images L I,new using Gaussian blur [72] as follows: where H I is the patch from the original HQ image, and K is a gaussian kernel for convolution.
Here, we used a Gaussian blur kernel of size 5 × 5. For haze, we applied the atmospheric scattering model [52] to synthesize new LQ hazy images L I,new assuming a homogeneous transmission map and several manually crafted depth maps d(x), as shown in Fig 2. This model is formulated as follows: where t(x) is the transmission map; H I (x) is the original HQ image, and A is the atmospheric light vector in the RGB domain. We can assume that the transmission map is homogeneous, and t(x) is represented as follows: where β is the medium extinction coefficient, and d(x) is the depth between the objects and the camera.
Finally, for illumination, we used γ-correction, which is a nonlinear transformation that adjusts the brightness of the image [56] to generate the unevenly illuminated LQ image L I,new . This model is formulated as follows: where the γ value, in the range of 0 < γ < 1, darkens the image and simulates low illumination.

Proposed network architecture
Our customized network is a convolutional neural network (CNN) with an encoder-decoder structure similar to U-Net [38], as depicted in Fig 3. While this structure has been found to work well for general image enhancement [73], we include an additional layer that incorporates parallel operations within a channel attention framework, so that specific aspects of the enhancement corresponding to the given image can adaptively emphasized. The encoding and decoding blocks, denoted as EncBlks and DecBlks, respectively, have nearly identical structures, except for the first 3 × 3conv and 3 × 3deconv layers, because EncBlks must downscale the input size and DecBlks must upscale the downsampled input. We used the parallel layer and adaptive attention mechanisms to selectively apply suitable operations for the given input [74], as AttOpBlk.
We applied five parallel operations in AttOpBlk: {1 × 1conv, 3 × 3conv, 5 × 5conv, 7 × 7conv, 3 × 3maxpool}, and a channel-wise attention layer to compute the attention weight, indicating the importance of each operation. The attention layer computes the attention weight through a 3-Layer-MLP with a channel-wise average of the input feature map and finds the optimal operation to be used in the corresponding EncBlk and DecBlk, considering various factors such as feature map size, degradation factors, the severity of degradation, and layer depth. As shown in Fig 4, at AttOpBlk l, the attention weight A l is expressed as follows: where U l 2 R jOjjC l j is the learnable matrix; |O| is the number of operations in the attention layer; F r is the ReLU function, and C l is the per-channel spatial average of input X l as follows.
where H and W refer to the height and width of the input feature map X l , and c denotes the channel of the input feature map X l . We used the per-channel average as the input of the channel-wise attention layer because the absolute intensity of the pixel map of the input feature has a significant impact in determining the degradation factor and its severity. Subsequently, vector A is normalized into � A such that the sum of the elements of attention weight is 1, and Z l is the result of the element-wise multiplication of � A and Y l , the results of applying each operation in the operation set to the input feature map of the layer. This process is formulated as follows: where � denotes the element-wise multiplication, and Y l = O(X l ) is the result of applying operations in the operation set on the input feature map X l . The input feature map X l is concatenated with the sum of the Z l to retain the knowledge learned in the previous layer, This connection is also interpreted as a residual connection [75] between the input and output of the layer that enables the gradient to be propagated into the input of the layer through backpropagation. Finally, a 1 × 1conv operation is placed at the end of the layer to adjust the channel of the output feature map of the AttOpBlk, and the output of the AttOpBlk l, S l is computed as follows: where |O| is the number of operations in the operation set; F c denotes 1 × 1 convolution, and � denotes channel-wise concatenation of two matrices. As shown in Fig 3, the entire network is structured following a composition of EncBlks and DecBlks. The width and height of the feature map are downsampled from the image by 2 4 , and the feature dimension becomes 2 10 after the encoding portion in the first half of the network.

PLOS ONE
For example, a latent feature representation of size 20 × 20 × 1024 results from an input image of size 320 × 320 × 3. In the decoding portion of the second half of the network, the latent features are upsampled and reconfigured to become an output of size 320 × 320 × 3, identical to the input.
To train the network, we use the following loss function: where y is the output of the network;ŷ is the reference image; N batch is the number of images in the minibatch, and W net is the weight parameters of the network. The first term is the pixelwise difference term to supervise the network output to be similar to the ground truth (the HQ image), while the second term is the L2 norm for the trainable weights of the network, which is a commonly used regularization term [76]. We used the L1 distance for the pixel-wise difference. Unlike other tasks, L2 distance may over-penalize the values in pixels with uneven illumination [77], given that our training dataset contained numerous dark LQ images and bright HQ image pairs. The parameter λ, set at λ = 0.1, controls the relative importance between the two terms.

Datasets
We sampled the training dataset comprising 1068 pairs of LQ and HQ fundus photographs of patients, acquired from the Kangbuk Samsung Hospital Ophthalmology Department (KBSMC) between 2017 and 2019, and denoted this as the KBSMC dataset. LQ images were taken either in the health screening process or from a preoperative examination. Corresponding HQ images are from the same patient, acquired after pupil dilation or surgery, from which accurate diagnosis can be achieved. In Fig 5, we depict two examples from the KBSMC dataset where improvements in image quality facilitate better diagnosis. We can observe regions (in the red boxes) where lesions become visible in the HQ images (small round hole and drusen for the first and second example, respectively). The majority of eye diseases are found in the peripheral region of the retina. Thus, these examples show how the peripheral region of the retinal fundus image is as important as the central field, and how well our KBSMC dataset is designed to train our model for various degradations on the retinal fundus image.
For evaluation, we constructed a test dataset from images, acquired from the ophthalmology department of Seoul National University Hospital (SNUH), denoted as the SNUH dataset. This dataset comprised 68 pairs of fundus photographs collected before and after cataract surgery, of which 29 (42.6%) of the pre-surgery LQ images were ungradable. Here, all images were of a resolution of 2400 × 2400.
Since we were unable to share the private datasets due to privacy issues, we also used the publicly available DRIVE [78], STARE [79], CHASE_DB1 [80] and DIARETDB1 [81] datasets, comprising 40, 397, 28, and 89 images, respectively, as additional test datsaets. We chose the DRIVE [78], STARE [79] and CHASE_DB1 [80] datasets because they are commonly used by studies, focusing on retinal fundus images and the evaluation of retinal vessel segmentation methods. The DIARETDB1 [81] dataset was chosen because many of its images have poor illumination and thus are suitable for the proposed method.
This study adhered to the tenets of the Declaration of Helsinki, and the protocol was reviewed and approved by the Institutional Review Boards (IRB) of Kangbuk Samsung Hospital (No. KBSMC 2019-08-031) and Seoul National University Hospital (C-2007-003-1137). Our study is a retrospective of medical records, and our data were fully anonymized before processing. The IRB waived the requirement for informed consent.

Evaluation settings and metrics
Training was performed using the entire KBSMC dataset, whereas testing was performed on the external SNUH dataset and publicly available DRIVE [78], STARE [79], CHASE_DB1 [80], and DIARETDB1 [81] datasets. Additionally, we performed five-fold cross-validation on the KBSMC dataset to serve as a reference when there is no domain shift.
We used three metrics to assess the quality of the enhanced image and to evaluate the proposed framework: i) PSNR [82], ii) SSIM [83], iii) r (linear index of fuzziness) [84,85]. For the SNUH dataset, we also measure the proportion of ungradable fundus images before and after the enhancement process.
Both PSNR and SSIM are reference metrics, used to measure the quality when compared with the reference GT. PSNR may not correspond to human intuition of overall image quality given that PSNR is based solely on the pixel-wise mean-squared error (MSE) between the

PLOS ONE
output image and GT. For example, a blurred output may lead to a lower MSE than a similar but slightly misaligned texture for high-frequency texture details [86]. Thus, we also used SSIM, which measures degradation as the relative change in perceived structural information. r is independent of the GT and can be measured solely from the output image. We primarily applied this metric to the public datasets that lacked the GT HQ images to serve as references. For PSNR or SSIM, higher values indicate that the enhanced image is closer to the GT image; whereas, for r, a lower value indicates a less noisy image and thus better performance. (This metric is originally denoted as γ by [84,85]. However, we denote this as r to avoid confusion with the γ in γ-correction).
To measure ungradable images, we define LQ images as ungradable following Fleming et al. [87] as: i) Images in which the third-generation branches cannot be identified within one optic disc diameter of the macular. ii) Images with various artifacts. iii) Images in which at least one of the macular, optic disc, superior temporal arcade, or inferior temporal arcade are incomplete. iv) Images in which the diagnosis cannot be obtained because of the degradation.
We also conducted a comparative evaluation, where we presented the PSNR, SSIM, and r results of three different algorithms, developed by Zhou et al. [42], Gaudio et al. [88], and Dai et al. [89], respectively along with the P-values of the proposed method. Table 1 shows the quantitative comparative evaluations of the KBSMC and SNUH test datasets, demonstrating that the proposed method achieves the best results for both datasets. We also compared the change in the proportion of ungradable fundus photographs with the SNUH dataset, based on our method. Among the 68 images from the SNUH datsaet, the

Evaluation of public datasets
We applied our trained model to four public datasets (DRIVE [78], STARE [79], CHASE_DB1 [80] and DIARETDB1 [81]) to demonstrate how effectively the proposed data augmentation

PLOS ONE
method synthesized various degradations, and how our pre-trained model improved the LQ image, sampled from the out-of-distribution datasets. Table 2 shows the quantitative evaluation of each dataset, based on the average r values, revealing whether P-values are within the level of statistical significance of 0.001. Although the proposed method produces the lowest r values for the DRIVE [78], STARE [79] and CHAS-E_DB1 [80] datasets, it produces a higher r value than the Gaudio et al. [88] method for the DIARETDB1 [81] dataset. This could be associated with the characteristics of Gaudio et al. [88] method, which maximizes the underlying pattern of the fundus image after amplifying the pixel color. However, the image may be unrealistic after drastically altering the appearance of the original image. Figs 8-Fig 11 provide qualitative comparisons of sample images from the DRIVE [78], STARE [79], CHASE_DB1 [80] and DIARETDB1 [81] datasets, respectively. The proposed method improves the image, makes its content visible more clearly, and minimizes unwanted changes.

Implementation details
For the hyperparameters, we used a mini-batch size of 16, an initial learning rate of α = 0.01 and decay rate of 0.9, as shown by [75] for 1000 epochs, each of which has approximately 300 iterations.
For comparative evaluations of three algorithms, we implemented our version based on the Zhou et al. [42] and Dai et al. [89] methods, following their descriptions of network architecture and hyperparameter settings. Moreover, we used the official implementation of Gaudio et al. [88] with the sA + sB + SC+ sX option.
Each experiment with different datasets using our CNN-based network is performed on a single NVIDIA GeForce GTX 2080Ti GPU, which takes about 0.91 second per 320 × 320 × 3 scaled image, while the three algorithms developed by Zhou et al. [42], Gaudio et al. [88], and Dai et al. [89] were evaluated on a single Intel Xeon Gold 6248R CPU.

Limitations
The proposed image enhancement framework is beneficial for most ungradable fundus images. However, two main limitations are identified that must be addressed. First is the

PLOS ONE
accuracy of GT images. Although all corresponding LQ and HQ fundus photograph pairs are from the same patients, factors that can be detrimental when determining the GT fundus image are the time interval between image acquisitions, the differences in positions or angles, inconsistent alignment between LQ and HQ fundus images after registration, and ungradable images or images with unknown diagnoses. We addressed these issues by i) minimizing the time interval between image pair acquisitions, ii) attaining accurate registration using the SURF-PIIFD-RPM method [69], and iii) using fundus images with a confirmed ophthalmic diagnosis, following a dilated fundus examination conducted by ophthalmologists. The disparity in the training and test datasets' characteristics or the domain shift between datasets is the second limitation. The PSNR and SSIM values for the SNUH test dataset, in which the training and test datasets are from different domains, are lower than those of the KBSMC test dataset, which is from the same domain with training samples. In Fig 12, we show

PLOS ONE
the failure cases of the SNUH test dataset. These examples illustrate the limitations of our image enhancement framework because the input image has a very low illumination. We note that the SNUH test set contains more severe cases of ungradable fundus images compared with the KBSMC dataset. Thus, the proposed framework may not work for test images with degradations, with different or more severe than those of the training images.

Clinical application
Experimental results on the SNUH dataset demonstrate that the proposed method can be used to reduce ungradable images. Thus, we plan to apply our method to images acquired during health screening. Our goal is to reduce unnecessary re-examinations and save the patient's time, money, and effort. Our framework can increase the diagnostic accuracy for LQ fundus photography, crucial for the ophthalmologist.
Our framework can also be used as a preprocessing step in other automated tasks, such as retinal vessel segmentation. Thus, the clarity of retinal blood vessels improves considerably after enhancement, as well as the results of vessel segmentation. Fig 13 depicts examples where the vessel segmentation are improved, using the iterative pixel thresholding method [90]. Two sampled ungradable fundus images and corresponding segmentation results also improved after enhancement.

Conclusion
This study proposed a comprehensive framework for deep learning image enhancement of fundus images, comprising dataset collection, data augmentation, and customized network architecture. Pairs of LQ with many image degradation factors and corresponding HQ images were collected under a protocol, including clinical diagnosis by ophthalmologists and detailed analysis of the enhancement effect on pathological features within the fundus images. Based on our novel dataset, we proposed an optimal CNN structure for retinal fundus image enhancement that could effectively handle complex degradation factors with an attention module. The proposed framework was evaluated on internal and external validation datasets,

PLOS ONE
as well as on DRIVE [78], STARE [79], CHASE_DB1 [80] and DIARETDB1 [81] databases. Among various poor image etiologies, our study provides a significant improvement in reducing the proportion of ungradable fundus photographs. Overall, our work is expected to have a clinical impact by lowering the rate of re-examinations among patients and by improving the accuracy of diagnosis.